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Abstract: In this paper, we determine which non-random sampling of fixed 
size gives the best linear predictor of the sum of a finite spatial population. 
We employ different multiscale superpopulation models and use the minimum 
mean-squared error as our optimality criterion. In multiscale superpopulation 
tree models, the leaves represent the units of the population, interior nodes 
represent partial sums of the population, and the root node represents the 
total sum of the population. We prove that the optimal sampling pattern varies 
dramatically with the correlation structure of the tree nodes. While uniform 
sampling is optimal for trees with "positive correlation progression" , it provides 
the worst possible sampling with "negative correlation progression." As an 
analysis tool, we introduce and study a class of independent innovations trees 
that are of interest in their own right. We derive a fast water-filling algorithm 
to determine the optimal sampling of the leaves to estimate the root of an 
independent innovations tree. 



1. Introduction 

In this paper we design optimal sampling strategies for spatial populations under 
different multiscale superpopulation models. Spatial sampling plays an important 
role in a number of disciplines, including geology, ecology, and environmental sci- 
ence. See, e.g., Cressie [5|. 



1.1. Optimal spatial sampling 

Consider a finite population consisting of a rectangular grid oi R x C units as 
depicted in Fig. [TJa). Associated with the unit in the i'^ row and j^^ column is 
an unknown value £ij. We treat the ^ij 's as one realization of a superpopulation 
model. 

Our goal is to determine which sample, among all samples of size n, gives the 
best linear estimator of the population sum, 5' := ^ ■ j £ij. We abbreviate variance, 
covariance, and expectation by "var" , "cov" , and "E" respectively. Without loss of 
generality we assume that E{iij) = for all locations (i, j). 
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Fig 1. (a) Finite population on a spatial rectangular grid of size R X C units. Associated with 
the unit at position {i,j) is an unknown value £ij. (b) Multiscale superpopulation model for a 
finite population. Nodes at the bottom are called leaves and the topmost node the root. Each leaf 
node corresponds to one value iij. All nodes, except for the leaves, correspond to the sum of their 
children at the next lower level. 



Denote an arbitrary sample of size n hy L. We consider linear estimators of S 
that take the form 

(1.1) S{L,a) :^ a'^L, 

where a is an arbitrary set of coefficients. We measure the accuracy of S{L, a) in 
terms of the mean-squared error (MSE) 

(1.2) £{S\L, a) := E (5 - S{L, a)) ^ 

and define the linear minimum mean-squared error (LMMSE) of estimating S from 
L as 

(1.3) £iS\L) ^riin £{S\L, a). 

Restated, our goal is to determine 

(1.4) L* := argmm£:(5|L). 

Our results are particularly applicable to Gaussian processes for which linear esti- 
mation is optimal in terms of mean-squared error. We note that for certain multi- 
modal and discrete processes linear estimation may be sub-optimal. 



1.2. Multiscale superpopulation models 



We assume that the population is one realization of a multiscale stochastic process 
(see Fig. (Hb)) (see Willsky [loj). Such processes consist of random variables orga- 
nized on a tree. Nodes at the bottom, called leaves, correspond to the population 
All nodes, except for the leaves, represent the sum total of their children at 
the next lower level. The topmost node, the root, hence represents the sum of the 
entire population. The problem we address in this paper is thus equivalent to the 
following: Among all possible sets of leaves of size n, which set provides the best 
linear estimator for the root in terms of MSE? 

Multiscale stochastic processes efficiently capture the correlation structure of a 
wide range of phenomena, from uncorrelated data to complex fractal data. They 
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B(t) 



111 1 

(c) 

Fig 2. (a) Binary tree for interpolation of Brownian motion, B(t), (b) Form child nodes 
and Ky2 by adding and subtracting an independent Gaussian random variable Wj from Vj/2. (c) 
Mid-point displacement. Set B(l) = V0 and form B{l/2) = (_B(1) - B(0))/2 + W0 = V01. Then 
B{1) — B{l/2) = V0/2 — W0 = V02. In general a node at scale j and position k from the left of 
the tree corresponds to B{{k + 1)2"^) - B{k2-i). 



do so through a simple probabihstic relationship between each parent node and its 
children. They also provide fast algorithms for analysis and synthesis of data and 
are often physically motivated. As a result multiscale processes have been used in 
a number of fields, including oceanography, hydrology, imaging, physics, computer 
networks, and sensor networks (see Willsky [20| and references therein, Riedi et al. 
[H], and Willett et al. (H). 

We illustrate the essentials of multiscale modeling through a tree-based inter- 
polation of one-dimensional standard Brownian motion. Brownian motion, B{t), is 
a zero-mean Gaussian process with B{0) := and va,i{B{t)) = t. Our goal is to 
begin with B{t) specified only at i = 1 and then interpolate it at all time instants 
t = k2^^ , k ^ 1,2, ... ,2^ for any given value j. 

Consider a binary tree as shown in Fig. [2fa). We denote the root by V0. Each 
node is the parent of two nodes connected to it at the next lower level, Vyi 
and Vj2, which are called its child nodes. The address 7 of any node is thus a 
concatenation of the form 0fcifc2 . . .kj, where j is the node's scale or depth in the 
tree. 

We begin by generating a zero-mean Gaussian random variable with unit variance 
and assign this value to the root, V0. The root is now a realization of B{1). We 
next interpolate B{0) and B{1) to obtain B{l/2) using a "mid-point displacement" 
technique. We generate independent innovation W0 of variance var(W0) = 1/4 and 
set 5(1/2) = V0/2 + W0 (see Fig.^c)). 

Random variables of the form B{{k + 1)2^-') — B{k2~^) are called increments of 
Brownian motion at time-scale j. We assign the increments of the Brownian motion 
at time-scale 1 to the children of V0. That is, we set 



(1.5) 



= B{l/2) - B{0) = V0/2 + W0, and 
V02 = B{1) - 5(1/2) = V0/2 - W0 
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as depicted in Fig.[2l^c). We continue the interpolation by repeating the procedure 
described above, replacing V0 by each of its children and reducing the variance of 
the innovations by half, to obtain V011, V012, V021, and ¥022- 

Proceeding in this fashion we go down the tree assigning values to the different 
tree nodes (see Fig. (Hb)). It is easily shown that the nodes at scale j are now 
realizations of B{{k + 1)2^-') — B{k2^^). That is, increments at time-scale j. For a 
given value of j we thus obtain the interpolated values of Brownian motion, B{k2^^ ) 
for fc = 0, 1, . . . , 2^ — 1, by cumulatively summing up the nodes at scale j. 

By appropriately setting the variances of the innovations Wj, we can use the 
procedure outlined above for Brownian motion interpolation to interpolate several 
other Gaussian processes (Abry et al. l]. Ma and Ji jjjj). One of these is fractional 
Brownian motion (fBm), Bnit) (0 < H < 1)), that has variance var(_B/f (t)) = t^^ . 
The parameter H is called the Hurst parameter. Unlike the interpolation for Brow- 
nian motion which is exact, however, the interpolation for fBm is only approximate. 
By setting the variance of innovations at different scales appropriately we ensure 
that nodes at scale j have the same variance as the increments of fBm at time-scale 
j. However, except for the special case when H = 1/2, the covariance between 
any two arbitrary nodes at scale j is not always identical to the covariance of the 
corresponding increments of fBm at time-scale j. Thus the tree-based interpolation 
captures the variance of the increments of fBm at all time-scales j but does not 
perfectly capture the entire covariance (second-order) structure. 

This approximate interpolation of fBm, nevertheless, suffices for several applica- 
tions including network traffic synthesis and queuing experiments (Ma and Ji [12]). 
They provide fast 0{N) algorithms for both synthesis and analysis of data sets of 
size N. By assigning multivariate random variables to the tree nodes Vy as well as 
innovations Wj , the accuracy of the interpolations for fBm can be further improved 



In this paper we restrict our attention to two types of multiscale stochastic 



novations trees (Chou et al. [3], Willsky [20|). In covariance trees the covariance 
between pairs of leaves is purely a function of their distance. In independent innova- 
tions trees, each node is related to its parent nodes through a unique independent 
additive innovation. One example of a covariance tree is the multiscale process 
described above for the interpolation of Brownian motion (see Fig. [2]). 

1 . 3. Summary of results and paper organization 

We analyze covariance trees belonging to two broad classes: those with positive cor- 
relation progression and those with negative correlation progression. In trees with 
positive correlation progression, leaves closer together are more correlated than 
leaves father apart. The opposite is true for trees with negative correlation pro- 
gression. While most spatial data sets are better modeled by trees with positive 
correlation progression, there exist several phenomena in finance, computer net- 
works, and nature that exhibit anti-persistent behavior, which is better modeled 
by a tree with negative correlation progression (Li and Mills [ll'|, Kuchment and 
Gelfan 0, Jamdee and Los Q). 

For covariance trees with positive correlation progression we prove that uniformly 
spaced leaves are optimal and that clustered leaf nodes provides the worst possible 
MSE among all samples of fixed size. The optimal solution can, however, change 
with the correlation structure of the tree. In fact for covariance trees with negative 



(Willsky d^). 



processes: covariance 
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correlation progression we prove that uniformly spaced leaf nodes give the worst 
possible MSE! 

In order to prove optimality results for covariance trees we investigate the closely 
related independent innovations trees. In these trees, a parent node cannot equal 
the sum of its children. As a result they cannot be used as superpopulation models 
in the scenario described in Section 11.11 Independent innovations trees are however 
of interest in their own right. For independent innovations trees we describe an 
efficient algorithm to determine an optimal leaf set of size n called water- filling. 
Note that the general problem of determining which n random variables from a 
given set provide the best linear estimate of another random variable that is not in 
the same set is an NP-hard problem. In contrast, the water- filling algorithm solves 
one problem of this type in polynomial-time. 

The paper is organized as follows. Section [2] describes various multiscale stochas- 
tic processes used in the paper. In Section [3] we describe the water-filling technique 
to obtain optimal solutions for independent innovations trees. We then prove opti- 
mal and worst case solutions for covariance trees in Section [D Through numerical 
experiments in Section [5] wc demonstrate that optimal solutions for multiscale pro- 
cesses can vary depending on their topology and correlation structure. We describe 
related work on optimal sampling in Section[6l Wc summarize the paper and discuss 
future work in Section [T] The proofs can be found in the Appendix. The pseudo- 
code and analysis of the computational complexity of the water-filling algorithm 
are available online (Ribeiro et al. [3|)- 

2. Multiscale stochastic processes 

Trees occur naturally in many applications as an efBcient data structure with a 
simple dependence structure. Of particular interest are trees which arise from rep- 
resenting and analyzing stochastic processes and time series on different time scales. 
In this section we describe various trees and related background material relevant 
to this paper. 

2.1. Terminology and notation 

A tree is a special graph, i.e., a set of nodes together with a list of pairs of nodes 
which can be pictured as directed edges pointing from one node to another with 
the following special properties (see Fig. [3|): (1) There is a unique node called the 
root to which no edge points to. (2) There is exactly one edge pointing to any node, 
with the exception of the root. The starting node of the edge is called the parent 
of the ending node. The ending node is called a child of its parent. (3) The tree is 
connected, meaning that it is possible to reach any node from the root by following 
edges. 

These simple rules imply that there are no cycles in the tree, in particular, there 
is exactly one way to reach a node from the root. Consequently, unique addresses 
can be assigned to the nodes which also reflect the level of a node in the tree. The 
topmost node is the root whose address we denote by 0. Given an arbitrary node 
7, its child nodes are said to be one level lower in the tree and are addressed by 7/0 
(A: = 1, 2, ... , P^), where P-y > 0. The address of each node is thus a concatenation 
of the form 0kik2 ... kj, or kik2 . . .kj for short, where j is the node's scale or depth 
in the tree. The largest scale of any node in the tree is called the depth of the tree. 
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Fig 3. Notation for multiscale stochastic processes. 



Nodes with no child nodes are termed leaves or leaf nodes. As usual, we denote 
the number of elements of a set of leaf nodes L by \L\. We define the operator 1 
such that jk f= 7. Thus, the operator f takes us one level higher in the tree to 
the parent of the current node. Nodes that can be reached from 7 by repeated t 
operations are called ancestors of 7. We term 7 a descendant of all of its ancestors. 

The set of nodes and edges formed by 7 and all its descendants is termed the 
tree of Clearly, it satisfies all rules of a tree. Let denote the subset of L that 
belong to the tree of 7. Let Afj be the total number of leaves of the tree of 7. 

To every node 7 we associate a single (univariate) random variable Vj. For the 
sake of brevity we often refer to Ky as simply "the node Vj" rather than "the 
random variable associated with node 7." 

2.2. Covariance trees 

Covariance trees are multiscale stochastic processes defined on the basis of the 
covariance between the leaf nodes which is purely a function of their proximity. 
Examples of covariance trees are the Wavelet-domain Independent Gaussian model 
(WIG) and the Multifractal Wavelet Model (MWM) proposed for network traffic 
(Ma and Ji [l2|, Riedi et al. [i3|). Precise definitions follow. 

Definition 2.1. The proximity of two leaf nodes is the scale of their lowest common 
ancestor. 

Note that the larger the proximity of a pair of leaf nodes, the closer the nodes 
are to each other in the tree. 

Definition 2.2. A covariance tree is a multiscale stochastic process with two prop- 
erties. (1) The covariance of any two leaf nodes depends only on their proximity. In 
other words, if the leaves 7' and 7 have proximity k then cov(V^, V^') =: cj.. (2) All 
leaf nodes are at the same scale D and the root is equally correlated with all leaves. 

In this paper we consider covariance trees of two classes: trees with positive 
correlation progression and trees with negative correlation progression. 

Definition 2.3. A covariance tree has a positive correlation progression if Ck > 
Ck-i > for k = 1, . . . , D — 1. A covariance tree has a negative correlation progres- 
sion if Ck < Cfc_i /or fc = 1, . . . , D - 1. 
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Intuitively in trees with positive correlation progression leaf nodes "closer" to 
each other in the tree are more strongly correlated than leaf nodes "farther apart." 
Our results take on a special form for covariance trees that are also symmetric trees. 

Definition 2.4. A symmetric tree is a multiscale stochastic process in which P^, 
the number of child nodes of V^, is purely a function of the scale of 7. 

2.3. Independent innovations trees 

Independent innovations trees are particular multiscale stochastic processes defined 
as follows. 

Definition 2.5. An independent innovations tree is a multiscale stochastic process 
in which each node V^, excluding the root, is defined through 

(2.1) := Q^Vjf + W^. 

Here, is a scalar and is a random variable independent of T^-f as well as of 
W^i for all 7' ^ 7. The root, V0, is independent of W-y for all 7. In addition ^ 0, 
var(W^) > V7 and var(V0) > 0. 

Note that the above definition guarantees that var(V^) > V7 as well as the 
linear independenc^ of any set of tree nodes. 

The fact that each node is the sum of a scaled version of its parent and an 
independent random variable makes these trees amenable to analysis (Chou et al. 
0, Willsky [20]). We prove optimality results for independent innovations trees in 
Section[3l Our results take on a special form for scale-invariant trees defined below. 

Definition 2.6. A scale-invariant tree is an independent innovations tree which 
is symmetric and where g^ and the distribution of are purely functions of the 
scale of 7. 

While independent innovations trees are not covariance trees in general, it is easy 
to see that scale-invariant trees are indeed covariance trees with positive correlation 
progression. 

3. Optimal leaf sets for independent innovations trees 

In this section we determine the optimal leaf sets of independent innovations trees 
to estimate the root. We first describe the concept of water-filling which we later 
use to prove optimality results. We also outline an efficient numerical method to 
obtain the optimal solutions. 

3.1. Water-filling 

While obtaining optimal sets of leaves to estimate the root we maximize a sum of 
concave functions under certain constraints. We now develop the tools to solve this 
problem. 



set of random variables is linearly independent if none of them can be written as a linear 
combination of finitely many other random variables in the set. 
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Definition 3.1. A real function ?/' defined on the set of integers {0, 1, ... , M} is 

discrete-concave if 

(3.1) 7/;(a; + 1) - tpix) > i^ix + 2) - tP{x + 1), for x = 0, 1, • ■ ■ , M - 2. 

The optimization problem we are faced with can be cast as follows. Given integers 
P > 2, Affc > {k — 1, . . . , P) and n < X]fc=i consider the discrete space 

(3.2) A„(Mi,...,Afp) |x = [.Xfe]Li : J^Xk = n; Xk £ {0, 1, . . . , A^}, Vfcj . 

Given non-decreasing, discrete-concave functions [k = 1, . . . ,P) with domains 
{0, . . . , M/c} we are interested in 

(3.3) h{n) -.^msixf^YlMxk) : X e A„(Afi, . . . , Mp)| . 

In the context of optimal estimation on a tree, P will play the role of the number of 
children that a parent node Vy has, Mk the total number of leaf node descendants 
of the fc-th child Vyk , and tpk the reciprocal of the optimal LMMSE of estimating 
Vj given Xk leaf nodes in the tree of Vyk- The quantity h(n) corresponds to the 
reciprocal of the optimal LMMSE of estimating node Vy given n leaf nodes in its 
tree. 

The following iterative procedure solves the optimization problem (|3.3p . Form 
vectors G'"' = [^["^iLi. n ^ 0, . . . ,J2k as follows: 
Step (i): Set = 0, Vfc. 
Step (ii): Set 

(3.4) 

where 

(3.5) TO G argmax {^^ (g^"^ + l) - ^fc (g^"^) : ffi"^ < Mk} . 

The procedure described in Steps (i) and (ii) is termed water-filling because it 
resembles the solution to the problem of filling buckets with water to maximize the 
sum of the heights of the water levels. These buckets are narrow at the bottom 
and monotonically widen towards the top. Initially all buckets are empty (compare 
Step (i)). At each step we are allowed to pour one unit of water into any one bucket 
with the goal of maximizing the sum of water levels. Intuitively at any step we 
must pour the water into that bucket which will give the maximum increase in 
water level among all the buckets not yet full (compare Step (ii)). Variants of this 
water-filling procedure appear as solutions to different information theoretic and 
communication problems (Cover and Thomas 0]). 

Lemma 3.1. The function h{n) is non- decreasing and discrete- concave. In addi- 
tion, 

(3.6) = 

k 

where g^j^^ is defined through water-filling. 
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When all functions ipk in Lemma |3. II are identical, the maximum of '4'k{xk) 
is achieved by choosing the Xk 's to be "near-equal" . The following Corollary states 
this rigorously. 

Corollary 3.1. If tjjk — V' for all k — 1,2,...,P with ijj non- decreasing and 
discrete- concave, then 

(3.7) h{n)={p-n + P 

The maximizing values of the Xk are apparent from (|3.7p . In particular, if n is a 
multiple of P then this reduces to 

(3.8) h{n) = Prp (^) . 



n-P 



P 



Corollary [3TT] is key to proving our results for scale- invariant trees. 
3.2. Optimal leaf sets through recursive water-filling 

Our goal is to determine a choice of n leaf nodes that gives the smallest possible 
LMMSE of the root. Recall that the LMMSE of given is defined as 

(3.9) ^(^^71^7) := minE(Ky - a^L^f, 

where, in an abuse of notation, denotes a linear combination of the elements 

of with coefficients ex. Crucial to our proofs is the fact that (Chou et al. Q and 
Willsky j2a]), 

Denote the set consisting of all subsets of leaves of the tree of 7 of size n by A-y (n) . 
Motivated by (|3.10p we introduce 

(3.11) IJ'jiri) :— max £{Vy\L) ^ 
and define 

(3.12) C^{n) := {L e A^(n) : £{V^\Ly^ = fi-yin)}. 

Restated, our goal is to determine one element of C0{n). To allow a recursive 
approach through scale we generalize p. lip and (|3.12p by defining 

(3.13) fi.y^'{n) := max £{Vy\L) ^ and 

(3.14) 'C^,7'(") := {L e Ay{n) : £{V^\Ly^ = ^^,y(n)}. 

Of course, C^{n) — C~f^^{n). For the recursion, we are mostly interested in ^^kin) , 
i.e., the optimal estimation of a parent node from a sample of leaf nodes of one of 
its children. The following will be useful notation 

P-, 

(3.15) X* ^ [xl]^l^ arg max l^inki^k)- 

xeA^{^r^u■■■,^f-,p^) 
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Using (|3.10p we can decompose the problem of determining L S £^{n) into 
smaller problems of determining elements of C~f^^k{xl) for all k as stated in the 
next theorem. 

Theorem 3.1. For an independent innovations tree, let there be given one leaf set 
L^^^ belonging to C^^^ki^X) f'^''" ^- Then Ufli -^'"'^^ G C^{n). Moreover, C^k{n) — 
^-yk.-ykin) = C^^^k{n). Also n-y ^-yk{n) is a positive, non- decreasing, and discrete- 
concave function of n, Vfc,7. 

Theorem 13.11 gives us a two step procedure to obtain the best set of n leaves in 
the tree of 7 to estimate V^. We first obtain the best set of x\ leaves in the tree of 
7A: to estimate V^k for all children of 7. We then take the union of these sets of 
leaves to obtain the required optimal set. 

By sub-dividing the problem of obtaining optimal leaf nodes into smaller sub- 
problems we arrive at the following recursive technique to construct L G C^{n). 
Starting at 7 we move downward determining how many of the n leaf nodes of 
L e C^in) lie in the trees of the different descendants of 7 until we reach the 
bottom. Assume for the moment that the functions for all 7, are given. 

Scale- Recursive Water-filling scheme 7-^7/;: 

Step (a): Split n leaf nodes between the trees of "fk, fc = 1, 2, . . . , P^. 
First determine how to split the n leaf nodes between the trees of 7/0 by maximizing 
Yjk=il^i,iki^k) over X e A„(A/'t,i, . . . ,A/'^p^) (see (13.15^ '). The split is given by 
X* which is easily obtained using the water-filling procedure for discrete-concave 
functions (defined in (|3.4p ) since fj,j^jk{n) is discrete-concave for all fc. Determine 
iC^-) e C^nkK) since L = U£i L^'"^ £ ^yi^). 

Step (b): Split nodes between the trees of child nodes of 7fc. 
It turns out that L^''^ G £j^^k{x*k) if and only if i^'') G C^k{xl). Thus repeat 
Step (a) with 7 = 7fc and n — x^. to construct L^''\ Stop when we have reached 
the bottom of the tree. 

We outline an efficient implementation of the scale-recursive water-filling algo- 
rithm. This implementation first computes L e C^{n) for n ~ 1 and then in- 
ductively obtains the same for larger values of n. Given L G C^{n) we obtain 
L G C^{n + \) as follows. Note from Step (a) above that we determine how to 
split the n leaves at 7. We are now required to split n + 1 leaves at 7. We easily 
obtain this from the earlier split of n leaves using (13. 4p . The water-filling technique 
maintains the split of n leaf nodes at 7 while adding just one leaf node to the tree 
of one of the child nodes (say 7fc') of 7. We thus have to perform Step (b) only 
for fc = fc'. In this way the new leaf node "percolates" down the tree until wc find 
its location at the bottom of the tree. The pseudo-code for determining L G C^{n) 
given var(W^) for all 7 as well as the proof that the recursive water- filling algorithm 
can be computed in polynomial-time are available online (Ribeiro et al. [14i|). 

3.3. Uniform leaf nodes are optimal for scale-invariant trees 

The symmetry in scale-invariant trees forces the optimal solution to take a partic- 
ular form irrespective of the variances of the innovations . We use the following 
notion of uniform split to prove that in a scale-invariant tree a more or less equal 
spread of sample leaf nodes across the tree gives the best linear estimate of the 
root. 
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Definition 3.2. Given a scale-invariant tree, a vector of leaf nodes L has uniform 
split of size n at node 7 if |L^| = n and \L^k\ is either [-p^J or [-p-J + 1 for all 
values of k. It follows that #{k : |L^fc| = + 1} = n - P^Lp-J- 

Definition 3.3. Given a scale-invariant tree, a vector of leaf nodes is called a 
uniform leaf sample if it has a uniform split at all tree nodes. 

The next theorem gives the optimal leaf node set for scale-invariant trees. 

Theorem 3.2. Given a scale- invariant tree, the uniform leaf sample of size n gives 
the best LMMSE estimate of the tree-root among all possible choices of n leaf nodes. 

Proof. For a scale- invariant tree, /i^^^fc(n) is identical for all k given any location 7. 
Corollary [3A] and Theorem 13.11 then prove the theorem. □ 

4. Covariance trees 

In this section we prove optimal and worst case solutions for covariance trees. For 
the optimal solutions we leverage our results for independent innovations trees and 
for the worst case solutions we employ eigenanalysis. We begin by formulating the 
problem. 

4.I. Problem formulation 

Let us compute the LMMSE of estimating the root V0 given a set of leaf nodes L 
of size n. Recall that for a covariance tree the correlation between any leaf node 
and the root node is identical. We denote this correlation by p. Denote an i x j 
matrix with all elements equal to 1 by lixj- It is well known (Stark and Woods 
[l7i] ) that the optimal linear estimate of V0 given L (assuming zero- mean random 
variables) is given by plixnQj^^L, where Ql is the covariance matrix of L and that 
the resulting LMMSE is 

(4 £{V0\L) = var(y0) - cov(L, V0fQl'coviL, V0) 

= var(V0) - (O^llxn^L^lnxl. 

Clearly obtaining the best and worst-case choices for L is equivalent to maximizing 
and minimizing the sum of the elements of QJ^^. The exact value of p does not 
affect the solution. We assume that no element of L can be expressed as a linear 
combination of the other elements of L which implies that Ql is invertible. 

4-2. Optimal solutions 

We use our results of Section [3] for independent innovations trees to determine the 
optimal solutions for covariance trees. Note from (|4.2[) that the estimation error for 
a covariance tree is a function only of the covariance between leaf nodes. Exploiting 
this fact, we first construct an independent innovations tree whose leaf nodes have 
the same correlation structure as that of the covariance tree and then prove that 
both trees must have the same optimal solution. Previous results then provide the 
optimal solution for the independent innovations tree which is also optimal for the 
covariance tree. 
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Definition 4.1. A matched innovations tree of a given covariance tree with pos- 
itive correlation progression is an independent innovations tree with the foUowing 
properties. It has (1) the same topology (2) and the same correlation structure be- 
tween leaf nodes as the covariance tree, and (3) the root is equally correlated with 
all leaf nodes (though the exact value of the correlation between the root and a leaf 
node may differ from that of the covariance tree). 

All covariance trees with positive correlation progression have corresponding 
matched innovations trees. We construct a matched innovations tree for a given 
covariance tree as follows. Consider an independent innovations tree with the same 
topology as the covariance tree. Set = 1 for all 7, 

(4.2) var(F0) = cq 
and 

(4.3) var(W^(^')) = c, - c,_i, j = 1, 2, . . . , 

where cj is the covariance of leaf nodes of the covariance tree with proximity j 
and var {W'-J^) is the common variance of all innovations of the independent inno- 
vations tree at scale j. Call c'j the covariance of leaf nodes with proximity j in the 
independent innovations tree. From (|2.ip we have 

j 

(4.4) c; = var(V0) -H^var(w^W) , j^l,...,D. 

k=l 

Thus, c'j = Cj for all j and hence this independent innovations tree is the required 
matched innovations tree. 

The next lemma relates the optimal solutions of a covariance tree and its matched 
innovations tree. 

Lemma 4.1. A covariance tree with positive correlation progression and its match- 
ed innovations tree have the same optimal leaf sets. 

Proof. Note that (|4.2p applies to any tree whose root is equally correlated with 
all its leaves. This includes both the covariance tree and its matched innovations 
tree. From (|4.2p we see that the choice of L that maximizes the sum of elements of 
is optimal. Since Q~[^ is identical for both the covariance tree and its matched 
innovations tree for any choice of L, they must have the same optimal solution. □ 

For a symmetric covariance tree that has positive correlation progression, the op- 
timal solution takes on a specific form irrespective of the actual covariance between 
leaf nodes. 

Theorem 4.1. Given a symmetric covariance tree that has positive correlation 
progression, the uniform leaf sample of size n gives the best LMMSE of the tree- 
root among all possible choices of n leaf nodes. 

Proof. Form a matched innovations tree using the procedure outlined previously. 
This tree is by construction a scale-invariant tree. The result then follows from 
Theorem 13.21 and Lemma [4. II □ 

While the uniform leaf sample is the optimal solution for a symmetric covariance 
tree with positive correlation progression, it is surprisingly the worst case solution 
for certain trees with a different correlation structure, which wc prove next. 
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4-3. Worst case solutions 

The worst case solution is any choice of L G A0(n) that maximizes £{V0\L). We now 
highhght the fact that the best and worst case solutions can change dramatically 
depending on the correlation structure of the tree. Of particular relevance to our 
discussion is the set of clustered leaf nodes defined as follows. 

Definition 4.2. The set consisting of all leaf nodes of the tree of Vy is called the 
set of clustered leaves of 7. 

We provide the worst case solutions for covariance trees in which every node 
(with the exception of the leaves) has the same number of child nodes. The following 
theorem summarizes our results. 

Theorem 4.2. Consider a covariance tree of depth D in which every node (ex- 
cluding the leaves) has the same number of child nodes a. Then for leaf sets of size 
cr^ , p = 0, 1, . . . , D, the worst case solution when the tree has positive correlation 
progression is given by the sets of clustered leaves of j, where 7 is any node at scale 
D — p. The worst case solution is given by the sets of uniform leaf nodes when the 
tree has negative correlation progression. 

Theorem 14.21 gives us the intuition that "more correlated" leaf nodes give worse 
estimates of the root. In the case of covariance trees with positive correlation pro- 
gression, clustered leaf nodes are strongly correlated when compared to uniform leaf 
nodes. The opposite is true in the negative correlation progression case. Essentially 
if leaf nodes are highly correlated then they contain more redundant information 
which leads to poor estimation of the root. 

While we have proved the optimal solution for covariance trees with positive 
correlation progression, we have not yet proved the same for those with negative 
correlation progression. Based on the intuition just gained we make the following 
conjecture. 

Conjecture 4.1. Consider a covariance tree of depth D in which every node (ex- 
cluding the leaves) has the same number of child nodes a. Then for leaf sets of 
size cr^, p = 0, 1, . . . , 13, the optimal solution when the tree has negative correlation 
progression is given by the sets of clustered leaves of 7, where 7 is any node at scale 
D -p. 

Using numerical techniques we support this conjecture in the next section. 
5. Numerical results 

In this section, using the scale-recursive water-filling algorithm we evaluate the 
optimal leaf sets for independent innovations trees that are not scale-invariant. In 
addition we provide numerical support for Conjecture 14.11 

5.1. Independent innovations trees: scale-recursive water-filling 

We consider trees with depth D ^ 3 and in which all nodes have at most two child 
nodes. The results demonstrate that the optimal leaf sets are a function of the 
correlation structure and topology of the multiscale trees. 

In Fig. HKa) we plot the optimal leaf node sets of different sizes for a scale- 
invariant tree. As expected the uniform leaf nodes sets are optimal. 
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(a) Scale-invariant tree (b) Tree with unbalanced variance 




1 • 

2 • • 

°P'™ai 3. 

leaf sets 

4 • . . . 

5. . . . . 
6. . . . . . 

(leaf set size) 

(c) Tree with missing leaves 

Fig 4. Optimal leaf node sets for three different independent innovations trees: (a) scale-invariant 
tree, (b) symmetric tree with unbalanced variance of innovations at scale 1, and (c) tree with 
missing leaves at the finest scale. Observe that the uniform leaf node sets are optimal in (a) as 
expected. In (b), however, the nodes on the left half of the tree are more preferable to those on 
the right. In (c) the solution is similar to (a) for optimal sets of size n = 5 or lower but changes 
for n = 6 due to the missing nodes. 

We consider a symmetric tree in Fig. HJJb), that is a tree in which all nodes have 
the same number of children (excepting leaf nodes). All parameters are constant 
within each scale except for the variance of the innovations W-y at scale 1. The 
variance of the innovation on the right side is five times larger than the variance 
of the innovation on the left. Observe that leaves on the left of the tree are now 
preferable to those on the right and hence dominate the optimal sets. Comparing 
this result to Fig.lDJa) we see that the optimal sets are dependent on the correlation 
structure of the tree. 

In Fig.HJc) we consider the same tree as in Fig. Ufa) with two leaf nodes missing. 
These two leaves do not belong to the optimal leaf sets of size n = 1 to n = 5 in 
Fig. HJa) but are elements of the optimal set for n = 6. As a result the optimal sets 
of size 1 to 5 in Fig. [4{c) are identical to those in Fig. HJa) whereas that for n = 6 
differs. This result suggests that the optimal sets depend on the tree topology. 

Our results have important implications for applications because situations arise 
where we must model physical processes using trees with different correlation struc- 
tures and topologies. For example, if the process to be measured is non-stationary 
over space then the multiscale tree may be unbalanced as in Fig. [HJb). In some 
applications it may not be possible to sample at certain locations due to physical 
constraints. We would thus have to exclude certain leaf nodes in our analysis as in 
Fig. He). 

The above experiments with tree-depth D = 3 are "toy-examples" to illustrate 
key concepts. In practice, the water-filling algorithm can solve much larger real- 
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world problems with ease. For example, on a Pentium IV machine running Matlab, 
the water-filling algorithm takes 22 seconds to obtain the optimal leaf set of size 
100 to estimate the root of a binary tree with depth 11, that is a tree with 2048 
leaves. 



5.2. Covariance trees: best and worst cases 



This section provides numerical support for Conjecture 14.11 that states that the 
clustered leaf node sets are optimal for covariance trees with negative correlation 
progression. We employ the WIG tree, a covariance tree in which each node has cr = 
2 child nodes (Ma and Ji 12]). We provide numerical support for our claim using a 
WIG model of depth D — 6 possessing a fractional Gaussian noise- correlation 
structure corresponding to H — 0.8 and H = 0.3. To be precise, we choose the 
WIG model parameters such that the variance of nodes at scale j is proportional 
to 2~^^^ (see Ma and Ji [12] for further details). Note that H > 0.5 corresponds to 
positive correlation progression while H < 0.5 corresponds to negative correlation 
progression. 

Fig. [5] compares the LMMSE of the estimated root node (normalized by the 
variance of the root) of the uniform and clustered sampling patterns. Since an 
exhaustive search of all possible patterns is very computationally expensive (for 
example there are over 10^^ ways of choosing 32 leaf nodes from among 64) we 
instead compute the LMMSE for 10^ randomly selected patterns. Observe that the 
clustered pattern gives the smallest LMMSE for the tree with negative correlation 
progression in Fig. [51(a) supporting our Conjecture 14.11 while the uniform pattern 
gives the smallest LMMSE for the positively correlation progression one in Fig. O^b) 
as stated in Theorem 14.11 As proved in Theorem 14.21 the clustered and uniform 
patterns give the worst LMMSE for the positive and negative correlation progression 
cases respectively. 




Fig 5. Comparison of sampling schemes for a WIG model with (a) negative correlation progression 
and (b) positive correlation progression. Observe that the clustered nodes are optimal in (a) while 
the uniform is optimal in (b). The uniform and the clustered leaf sets give the worst performance 
in (a) and (b) respectively, as expected from our theoretical results. 



^Fractional Gaussian noise is the increments process of fBm (Mandelbrot and Ness [l^). 
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6. Related work 

Earlier work has studied the problem of designing optimal samples of size n to 
linearly estimate the sum total of a process. For a one dimensional process which 
is wide-sense stationary with positive and convex correlation, within a class of 
unbiased estimators of the sum of the population, it was shown that systematic 
sampling of the process (uniform patterns with random starting points) is optimal 
(Hajek '^). 

For a two dimensional process on an ni x 712 grid with positive and convex cor- 
relation it was shown that an optimal sampling scheme does not lie in the class 
of schemes that ensure equal inclusion probability of n/(nin2) for every point on 
the grid (Bellhouse 0] ) • In Bellhouse , an "optimal scheme" refers to a sampling 
scheme that achieves a particular lower bound on the error variance. The require- 
ment of equal inclusion probability guarantees an unbiased estimator. The optimal 
schemes within certain sub-classes of this larger "equal inclusion probability" class 
were obtained using systematic sampling. More recent analysis refines these results 
to show that optimal designs do exist in the equal inclusion probability class for 
certain values of n, ni, and 712 and are obtained by Latin square sampling (Lawry 
and Bellhouse [To^, Salehi flE]). 

Our results differ from the above works in that we provide optimal solutions for 
the entire class of linear estimators and study a different set of random processes. 

Other work on sampling fractional Brownian motion to estimate its Hurst pa- 
rameter demonstrated that geometric sampling is superior to uniform sampling 
(Vidacs and Virtamo [S])- 

Recent work compared different probing schemes for traffic estimation through 
numerical simulations (He and Hou ^7)]). It was shown that a scheme which used 
uniformly spaced probes outperformed other schemes that used clustered probes. 
These results are similar to our findings for independent innovation trees and co- 
variance trees with positive correlation progression. 

7. Conclusions 

This paper has addressed the problem of obtaining optimal leaf sets to estimate the 
root node of two types of multiscale stochastic processes: independent innovations 
trees and covariance trees. Our findings are particularly useful for applications 
which require the estimation of the sum total of a correlated population from a 
finite sample. 

We have proved for an independent innovations tree that the optimal solution 
can be obtained using an efficient water-filling algorithm. Our results show that 
the optimal solutions can vary drastically depending on the correlation structure of 
the tree. For covariance trees with positive correlation progression as well as scale- 
invariant trees we obtained that uniformly spaced leaf nodes are optimal. However, 
uniform leaf nodes give the worst estimates for covariance trees with negative cor- 
relation progression. Numerical experiments support our conjecture that clustered 
nodes provide the optimal solution for covariance trees with negative correlation 
progression. 

This paper raises several interesting questions for future research. The general 
problem of determining which n random variables from a given set provide the best 
linear estimate of another random variable that is not in the same set is an NP- 
hard problem. We, however, devised a fast polynomial-time algorithm to solve one 
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problem of this type, namely determining the optimal leaf set for an independent 
innovations tree. Clearly, the structure of independent innovations trees was an 
important factor that enabled a fast algorithm. The question arises as to whether 
there are similar problems that have polynomial-time solutions. 

We have proved optimal results for covariance trees by reducing the problem to 
one for independent innovations trees. Such techniques of reducing one optimization 
problem to another problem that has an efficient solution can be very powerful. If a 
problem can be reduced to one of determining optimal leaf sets for independent in- 
novations trees in polynomial-time, then its solution is also polynomial-time. Which 
other problems are malleable to this reduction is an open question. 

Appendix 

Proof of Lemma [3A[ We first prove the following statement. 

Claim (1): If there exists X* — [xl] £ A„(Mi, . . . , Mp) that has the following 
property: 

(7.1) U^*) - - 1) > ^,{x* + 1) - ^,{x*), 
Mi ^ j such that x* > and x* < Mj , then 

p 

(7.2) h{n)^Y.^k{xl). 

fc=i 

We then prove that such an X* always exists and can be constructed using the 
water-filling technique. 

Consider any X e A„(A/i, . . . , Mp). Using the following steps, we transform the 
vector X two elements at a time to obtain X* . 
Step 1: (Initialization) Set X = X. 

Step 2: If X ^ X*, then since the elements of both X and X* sum up to n, there 
must exist a pair i,j such that Xi ^ x* and Xj ^ x*. Without loss of generality 
assume that Xi < x* and Xj > x*. This assumption implies that x* > and 
X* < Mj. Now form vector Y such that yi = Xi + 1, yj = Xj — 1^ and yk — Xk for 
k 7^ From (j7.ip and the concavity of ipi and tpj we have 

tpi{yi) - ipi{xi) = ipi{xi + 1) - ij)i{xi) > ipilx*) ~ ij^ilx* ~ I) 

(7.3) > {x* + 1) - (x*) > {xj) ~ {x, ~ 1) 

> 'tl;j{xj)-ipj{yj). 

As a consequence 

C^-^) ^ii^kivk) - i'kixk)) = i/jtiyi) - -iptixi) + i^jivj) - ^]{x.j) > 0. 

k 

Step 3: If F 7^ X* then set X = Y and repeat Step 2, otherwise stop. 

After performing the above steps at most J^k -^k times, Y = X* and (|7.4p gives 

(7.5) '^'4>k{x*k) = ^ipk{yk) >^''Pk{xk)- 

This proves Claim (1). 

Indeed for any X ^ X* satisfying (|7.ip we must have ipk(xk) = J2k ^kixl). 
We now prove the following claim by induction. 
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Claim (2): G^") G A„(Afi, . . . , Mp) and G^") satisfies jLl 
(Initial Condition) The claim is trivial for n — 0. 
(Induction Step) Clearly from dO]) and ([33]) 



(7.6) E^^'^^i+E^r-^+i' 



and < 5^"+^' < Mfe. Thus G("+i) G A„+i (Mi, Mp). We now prove that 
(^(ri+i) satisfies property (17. ip . We need to consider pairs z, j as in (17. ip for which 
either i = m or j = m because all other cases directly follow from the fact that 
G(") satisfies (HU. 

Case (i) j = m, where m is defined as in p.Sp . Assuming that .g™^^^ < Mm, for 
all i ^ m such that ffj-"^^'' > we have 

^m {ai:^ + 1) - V-™ 



(7.7) 



> 
> 



V-™ (sL"^ + 2) - V^„. + 1) 



Case (ii) i = m. Consider j ^ m such that ffj"^^"* < Mj. We have from 
that 

V-™ - ^™ (5^?+^' - 1) = ^™ + 1) - ^™ (g^r^) 

(7.8) >^.(ff?^ + i)-V.,(5r^) 

Thus Claim (2) is proved. 

It only remains to prove the next claim. 

Claim (3): h(n), or equivalently V'fc(5i"''), is non-decreasing and discrete- 
concave. 

Since ipk is non-decreasing for all fc, from p.4p we have that X^fc V'fcCffl"'') is a 
non-decreasing function of n. We have from 



h{n + 1) - M^i) = E - V'.(5i"^)) 

^^'^^ = max |v,,(.g(") + l)-Vfe(5i"M- 

From the concavity of ipk and the fact that 5^"^^'' > 5^"'' we have that 

(7.10) Mgi"^ + 1) - > V'fcCffr'^ + 1) - V-fcCffr'^ 

for all k. Thus from (|7.10p and (|7.10p . h{n) is discrete-concave. □ 

Proof of Corollary\3l^ Set xl ^ [^\ for 1 < k < P - n + P [^\ and = 1 + [^J 
for all other k. Then X* = [x^] G A„(Mi, . . . , Mp) and X* satisfies ([TT]) from 
which the result follows. □ 

The following two lemmas are required to prove Theorem 13. II 
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Lemma 7.1. Given independent random variables A, W, F , define Z and E through 
Z :— (^A + W and E :— r\Z + F where C, r\ are constants. We then have the result 



(7.11) 



var(A) coY{Z,Ef ^ + var{W) /va,r{A) 



cov{A, EY var(Z) 



> 1. 



Proof. Without loss of generality assume all random variables have zero mean. We 
have 



(7.12) 



cov(£', Z) = E{r]Z^ + FZ) = ?7var(Z), 
cov(A, E) = E((ry(CA + W)+ F)A)Cr?var(A), 



(7.13) 
and 

(7.14) var(Z) = E(C^yl2 + W'^ + 2CAW) = C^var(A) + var(VK). 
Thus from (j7A2ll . (l7J3ll and (jTlll) 

co\{Z,Ef var(A) ifY&i{Z) ^ +y&t:{W) h&YlA) ^ 

(7.15) TTT^ — — ;^77r — „ rTT = > 1- 



var(Z) cov(A, FY CY-rfy&T:{A) 



□ 



Lemma 7.2. Given a positive function Zi,i £ Z and constant a > such that 

(7.16) := 

1 - azi 

is positive, discrete- concave, and non- decreasing, we have that 

* - 

is also positive, discrete- concave, and non- decreasing for all [3 with < /3 < a. 

Proof. Define := Zi — Zi^i. Since Zi is positive and r^ is positive and non- 
decreasing, azi < 1 and Zi must increase with i, that is Ki > 0. This combined with 
the fact that Pzi < aZi < 1 guarantees that Si must be positive and non-decreasing. 

It only remains to prove the concavity of di. From (|7.16p 

a{zi+i - Zi) 



(7.18) 



n+i - n 



(1 - azi+i){l - azi) 
We are given that is discrete-concave, that is 
fj^^-^ > {r^+2 - n+i) - {r,+i ~ ri) 

= ariri+i 

Since > Vi, we must have 

1 — az. 



UKi+iri+ir^. 





^ I - aZi \ 






V 1 - az.i+2 J 









(7.20) 



1 — az. 



i+2 



< 0. 



Similar to (|7.20[) we have that 

(7.21) {S,+2 - S,+i) - (,5,+i - S,) = 



Ki+2 



1 - I3z, 

1 - /3Zj+2 
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Since Si > Vi, for the concavity of Si it suffices to show that 



(7.22) 
Now 



Ki+2- a '^i+l 

1 - (3zi+2 



< 0. 



23) 1 - _ 1 - Pz, ^ {a- P){z,+2 - z^) ^ Q 

l-aZi+2 l-/3zi+2 (1 - a2i+2)(l - /3zi+2) ~ 

Then ((7?^ and (|7?^ combined with the fact that n, > 0, Vi proves (fT^ . □ 
Proof of Theorem \3.1\ We split the theorem into three claims. 

Claim (f): L* UkL^'''>{xl) £ C^{n). 

From ((3l0)) . |3lT|) . and ((3l^ we obtain 

var(V^) LeA_,(ji)' 



P — 1 



(7.24) 

< max A^7,7fe(a;fc). 

XeA„(A^-,i,...,A/--,P^)-^^ 

Clearly L* e A^(n). We then have from (|3?T0l) and ([3TT|l that 

^ XeA„(AA^i,...,A/-.,^,J^ 

Thus from (fr25)) and (fr26l) we have 

P -1 

(7.26) ii^{n) ^ £{V^\L*) ^= max V^^,^fc(a;fc) ^— — , 

xeA„(A/^^i,...,A/:yp.,) ^ var(V/7) 

which proves Claim (1). 

Claim (2): If L G C^kin) then L S Cj^^kin) and vice versa. 
Denote an arbitrary leaf node of the tree of jk as E. Then V^, V^k, and i? are 
related through 

(7.27) V^k = QjkV^ + W^k, 
and 

(7.28) E = 77^^,. + F 

where rj and g-yk are scalars and W^-fc, F and are independent random variables. 
We note that by definition var(V^) > V7 (see Definition 12. 5p . From Lemma [7.11 
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we have 

var(w^fc) 



cov(Kyfc,g) _ /yarCMy/' / Q^k+ vanv,) 



1/2 



(7.29) 



cov{V^,E) \var{V^) J \ g^^. 



vari 



var(K,) ; 



From (|7.30p we see that ^^.fc is not a function of E. 

Denote the covariance between and leaf node vector L — [£i] G A^k{n) as 
j^L = [cov{V^,£i) 



e^,L = [coY(y^,e^)f. Then (|73(I)) gives 



(7.30) e^k,L = ^7,fc07,i- 
From ()4.2p we have 

(7.31) £:(K^|L) = var(y^) - (p(7,L) 

where ^p{'-f,L) = j^QJ^^Q^^l- Note that ip{'-f,L) > since is positive semi- 
definite. Using (|7.30p we similarly get 



(7.32) £{V^k\L)=Yar{V^k) 



From (|7.3ip and (|7.32p we see that £{V^\L) and £{V^k\L) are both minimized over 
L g A^fc(n) by the same leaf vector that maximizes (/'(Ti L). This proves Claim (2). 

Claim (3): fi-y^-yk(n) is a positive, non-decreasing, and discrete-concave function 
of n, Vfc, 7. 

We start at a node 7 at one scale from the bottom of the tree and then move up 
the tree. 

Initial Condition: Note that Vyk is a leaf node. From (|2.ip and (??) we obtain 

(7.33) £iV,\V^k) - var(K,) - (g7fevar(F ))^ ^ 

var^ v-yfe j 

For our choice of 7, /i^^^fc(l) corresponds to £{V^\V^k)~^ and /i-,..7fc(0) corresponds 
to l/var(l^). Thus from l|7.33p . ji^^^kin) is positive, non-decreasing, and discrete- 
concave (trivially since n takes only two values here). 

Induction Step: Given that /i^^^fc(n) is a positive, non-decreasing, and discrete- 
concave function of n for k — 1, . . . , P^, we prove the same when 7 is replaced by 
7 t- Without loss of generality choose k such that (7 '\)k = 7. From (|3.1ip . (|3.13p . 
(17311) . (LTI^i and Claim (2), we have for L G C^{n) 

(7.34) \ ^^'^(^■'^ 

^^^■"^''^ = var(K,.) ■ i_ , ^(7.^) • 
^ ^'^ ^ ?^,,,var(y,T) 

From (|7.26p . the assumption that /i^^^fc(n) Vfc is a positive, non-decreasing, and 
discrete-concave function of n, and Lemma 13.11 we have that fJ.'y{n) is a non- 
decreasing and discrete-concave function of n. Note that by definition (see (j3.1ip ) 
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/x^(n) is positive. This combined with (|2.1[) . (|7.35p . (|7.30p and Lemma [7.21 then 
prove that is also positive, non-decreasing, and discrete-concave. □ 



We now prove a lemma to be used to prove Theorem 14.21 As a first step we 
compute the leaf arrangements L which maximize and minimize the sum of all 
elements oi Ql = [Qi,j{L)]. We restrict our analysis to a covariance tree with depth 
D and in which each node (excluding leaf nodes) has a child nodes. We introduce 
some notation. Define 

(7.35) r*^"-'(p) {L : L € A0{a'^) and X is a uniform leaf node set} and 

(7.36) r'-''\p) := {L : L is a clustered leaf set of a node at scale D — p} 

for p = 0,1, . . . , D. We number nodes at scale m in an arbitrary order from q = 
0,1,..., cr™ — 1 and refer to a node by the pair (m, q). 

Lemma 7.3. Assume a positive correlation progression. Then, X^i j li.ji-^) rnin- 
imized over L G A0((t^) by every L G r*^"''(p) and maximized by every 
For a negative correlation progression, '^ijqi,j{L) is maximized by every L G 
r'") (p) and minimized by every 

Proof. Set p to be an arbitrary element in {!,...,£) — 1}. The cases oi p = and 
p — D are trivial. Let — #{9j.j(^) G Ql ■ <li.j{L) = c,„} be the number of 
elements of Ql equal to c,„. Define a„i :— X^feLo m > and set a_i = 0. Then 

D D-l 



J2 * J 



D-1 D-2 



(7.37) 



m— m— — 1 

D-2 

= ^ (Cm - Cm+l)ayn + CD_iaD-l - coa_i -I- CD-do 
m=0 
D-2 

= X! ~ Cm+i)am + constant, 

m=0 

where we used the fact that ajj^i = ajj — "do is a constant independent of the 
choice of L, since 'dn — a'P and a/j = a^'P . 

We now show that L e r^")(p) maximizes am.,^rn while L e r*^^^(p) minimizes 
ttrm'^rn. First we prove the results for L G r("^(p). Note that L has one element in 
the tree of every node at scale p. 

Case (i) m > p. Since every element of L has proximity at most p—1 with all other 
elements, a™ — which is the maximum value it can take. 

Case (ii) m < p (assuming p > 0). Consider an arbitrary ordering of nodes at scale 
m + 1. We refer to the q^^ node in this ordering as "the q^^ node at scale m + 1" . 

Let the number of elements of L belonging to the sub-tree of the q^^ node at 
scale m + 1 he gq,q = 0, . . . , cr'"+^ — 1. We have 

(7.38) a,„= ^ g,(aP-5,) = ^ ^ (5. " ^ W 

q=0 q=0 

since every element of L in the tree of the q^^ node at scale m + 1 must have 
proximity at most m with all nodes not in the same tree but must have proximity 
at least m + 1 with all nodes within the same tree. 
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The choice of g^'s is constrained to he on the hyperplane gq — a^. Obviously 
the quadratic form of (|7.38[) is maximized by the point on this hyperplane closest to 
the point (ctP/2, . . . , (7^/2) which is {aP'"'^'^, . . . , ctP"™"!). This is clearly achieved 
by L G r(")(p). 

Now we prove the results for L G T^'^'>{p). 
Case (i) m < D — p. We have a™ = 0, the smallest value it can take. 
Case (ii) D — p < m < D. Consider leaf node £i ^ L which without any loss of 
generality belongs to the tree of first node at scale m + 1. Let am{£i) be the number 
of elements of L to which ii has proximity less than or equal to m. Now since £i has 
proximity less than or equal to m only with those elements of L not in the same 
tree, we must have amiii) > o-p — a^~"^~^ . Since L G T'^'^\p) achieves this lower 
bound for am{ii)^'ii and a,„ = am{ii), L G F^"^^ minimizes a„i in turn. □ 



We now study to what extent the above results transfer to the actual matrix of 

-1 

L 



interest Qr ^ . We start with a useful lemma. 



Lemma 7.4. Denote the eigenvalues of Ql by Xj,j — 1, . • . , cr^. Assume that no 
leaf node of the tree can be expressed as a linear combination of other leaf nodes, 
implying that Xj > 0,Vj. Set Vl — [di,j]crPxiyp Ql^- Then there exist positive 
numbers fi with /i + . . . + /p = 1 such that 



(7.39) ^ (z,j -^^''E/j^J' and 

i,i=l j = l 

a" a" 

(7.40) E '^-^ ='^'E/^ A^- 

Furthermore, for both special cases, L G r(")(p) and L G T'-''\p), we may choose 
the weights fj such that only one is non-zero. 

Proof. Since the matrix Ql is real and symmetric there exists an orthonormal 
eigenvector matrix U = [uij] that diagonalizes Ql, that is Ql — UEU'^ where S 
is diagonal with eigenvalues Xj, j — 1, . . . , a^. Define Wj := ^ij. Then 

(7.41) "'^ ^ 



[Wi ... Wcrp]'E.[wi . . . WcTp]^ = E -^J^l 



Further, since U'^ = U ^ we have 

(7.42) E^'f = illxapU){U^l^Pxl) = llxapIlaPxl = . 

J 

Setting /,; — w^/aP establishes (|7.39p . Using the decomposition 

(7.43) Ql^ = {U^r^E-^U-^ = UE-^U^ 

similarly gives (|7.40p . 

Consider the case L G F'"^ (p). Since L — [£i] consists of a symmetrical set of leaf 
nodes (the set of proximities between any element ii and the rest does not depend 
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on i) the sum of the covariances of a leaf node £i with its fellow leaf nodes does not 
depend on i, and we can set 

a" V 

(7.44) A(") '?'.^ (^) = + E ^""""cm. 

j—l m— 1 

With the sum of the elements of any row of being identical, the vector 1<tpxi 
is an eigenvector of Ql with eigenvalue A*-"' equal to (|7.44p . 

Recall that we can always choose a basis of orthogonal eigenvectors that includes 
lo-pxi as the first basis vector. It is well known that the rows of the corresponding 
basis transformation matrix U will then be exactly these normalized eigenvectors. 
Since they are orthogonal to Io-pxi, the sum of their coordinates Wj (j = 2, . . . , a^) 
must be zero. Thus, all fi but /i vanish. (The last claim follows also from the 
observation that the sum of coordinates of the normalized lo-pxi equals wi = 
^ ^p/2. ((7:421) Wj = for all other j.) 

Consider the case L e r(")(p). The reasoning is similar to the above, and we can 
define 

a" p 

(7.45) A(^) := ^ g,,,(i) =co+Y. ^""^D-m- 

J — l m—1 

□ 

Proof of Theorem \4-S\ Due to the special form of the covariance vector cov(L, V0)— 
plixo-fc we observe from (|4.2p that minimizing the LMMSE f (V0IL) over L € A0(n) 
is equivalent to maximizing j di,j{L) the sum of the elements of QJ^^ ■ 

Note that the weights fi and the eigenvalues A^ of Lemma 17.41 depend on the 
arrangement of the leaf nodes L. To avoid confusion, denote by A^ the eigenvalues of 
Ql for an arbitrary fixed set of leaf nodes L, and by A*^"^ and A^^"^) the only relevant 
eigenvalues of i G T^'^\p) and L e r('=)(p) according to and fT^iS)) . 

Assume a positive correlation progression, and let L be an arbitrary set of 
leaf nodes. Lemma 17.31 and Lemma 17.41 then imply that 

(7.46) A(")<^A,/,<A(=). 

j 

Since Ql is positive definite, we must have Xj > 0. We may then interpret the mid- 
dle expression as an expectation of the positive "random variable" A with discrete 
law given by fi. By Jensen's inequality, 



(7.47) E(l/A,)/, > 



1 1 

> 



E,A,/, - AW 



Thus, j '^iJ is minimized by L e r(^'(p); that is, clustering of nodes gives the 
worst LMMSE. A similar argument holds for the negative correlation progression 
case which proves the Theorem. □ 
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